:orphan: Sklearn Basics 1: Train, Evaluate and Deploy a Classifier ========================================================= In this lesson, we will learn how to train, evaluate and deploy a classifier with Khiops sklearn. We start by importing the sklearn estimator ``KhiopsClassifier``: .. code:: ipython3 import os import pandas as pd from khiops import core as kh from khiops.sklearn import KhiopsClassifier # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() Training a Classifier --------------------- We’ll train a classifier for the ``Iris`` dataset. This is a classical dataset containing data of different plants belonging to the genus *Iris*. It contains 150 records, 50 for each of the three *Iris*\ ’s variants: *Setosa*, *Virginica* and *Versicolor*. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the *Iris*, based on the petal and sepal characteristics. To train a classifier with Khiops, we only need a dataframe that we are going to load from a file. Let’s first save the location of this file into a variable ``iris_data_file``, load it and take a look at its content: .. code:: ipython3 iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt") iris_df = pd.read_csv(iris_data_file, sep="\t") print(f"Iris data: 10 first records") iris_df.head() .. parsed-literal:: Iris data: 10 first records .. parsed-literal:: SepalLength SepalWidth PetalLength PetalWidth Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa Before training the classifier, we split the data into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the ``Class`` column). .. code:: ipython3 X_iris_train = iris_df.drop("Class", axis=1) y_iris_train = iris_df["Class"] Let’s check the contents of the feature matrix and the target vector: .. code:: ipython3 print("Features of the Iris dataset:") display(X_iris_train.head()) print("") print("Label of the Iris dataset:") display(y_iris_train.head()) .. parsed-literal:: Features of the Iris dataset: .. parsed-literal:: SepalLength SepalWidth PetalLength PetalWidth 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 .. parsed-literal:: Label of the Iris dataset: .. parsed-literal:: 0 Iris-setosa 1 Iris-setosa 2 Iris-setosa 3 Iris-setosa 4 Iris-setosa Name: Class, dtype: object Let’s now train the classifier with the Khiops function ``KhiopsClassifier``. This method returns a model ready to classify new Iris plants. *Note: By default Khiops builds 10 decision trees. This is not necessary for this tutorial so we set ``n_trees=0``* .. code:: ipython3 khc_iris = KhiopsClassifier(n_trees=0) khc_iris.fit(X_iris_train, y_iris_train) .. raw:: html
KhiopsClassifier(n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Exercise ~~~~~~~~ We’ll repeat the same steps with the ``Adult`` dataset. It contains characteristics of a adult population in the USA such as age, gender and education. The task here is to predict the variable ``class`` which indicates if the individual earns ``more`` or ``less`` than 50,000 dollars. Let’s start by loading the ``Adult`` dataframe and checking its contents: Load the adult dataset and take a look at its content ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt") adult_df = pd.read_csv(adult_data_file, sep="\t") print(f"Adult data: 10 first records") adult_df.head() .. parsed-literal:: Adult data: 10 first records .. parsed-literal:: Label age workclass fnlwgt education education_num \ 0 1 39 State-gov 77516 Bachelors 13 1 2 50 Self-emp-not-inc 83311 Bachelors 13 2 3 38 Private 215646 HS-grad 9 3 4 53 Private 234721 11th 7 4 5 28 Private 338409 Bachelors 13 marital_status occupation relationship race sex \ 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female capital_gain capital_loss hours_per_week native_country class 0 2174 0 40 United-States less 1 0 0 13 United-States less 2 0 0 40 United-States less 3 0 0 40 United-States less 4 0 0 40 Cuba less Build the feature matrix and the the target vector to train the ``Adult`` classifier ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that the name of the target variable is ``class`` (**in lower case!**). .. code:: ipython3 X_adult_train = adult_df.drop(["class"], axis=1) y_adult_train = adult_df["class"] print("Adult dataset feature matrix (first 10 rows):") display(X_adult_train.head(10)) print("Adult dataset target vector (first 10 values):") display(y_adult_train.head(10)) .. parsed-literal:: Adult dataset feature matrix (first 10 rows): .. parsed-literal:: Label age workclass fnlwgt education education_num \ 0 1 39 State-gov 77516 Bachelors 13 1 2 50 Self-emp-not-inc 83311 Bachelors 13 2 3 38 Private 215646 HS-grad 9 3 4 53 Private 234721 11th 7 4 5 28 Private 338409 Bachelors 13 5 6 37 Private 284582 Masters 14 6 7 49 Private 160187 9th 5 7 8 52 Self-emp-not-inc 209642 HS-grad 9 8 9 31 Private 45781 Masters 14 9 10 42 Private 159449 Bachelors 13 marital_status occupation relationship race sex \ 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female 5 Married-civ-spouse Exec-managerial Wife White Female 6 Married-spouse-absent Other-service Not-in-family Black Female 7 Married-civ-spouse Exec-managerial Husband White Male 8 Never-married Prof-specialty Not-in-family White Female 9 Married-civ-spouse Exec-managerial Husband White Male capital_gain capital_loss hours_per_week native_country 0 2174 0 40 United-States 1 0 0 13 United-States 2 0 0 40 United-States 3 0 0 40 United-States 4 0 0 40 Cuba 5 0 0 40 United-States 6 0 0 16 Jamaica 7 0 0 45 United-States 8 14084 0 50 United-States 9 5178 0 40 United-States .. parsed-literal:: Adult dataset target vector (first 10 values): .. parsed-literal:: 0 less 1 less 2 less 3 less 4 less 5 less 6 less 7 more 8 more 9 more Name: class, dtype: object Train a classifier for the ``Adult`` dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Do not forget to set ``n_trees=0`` .. code:: ipython3 khc_adult = KhiopsClassifier(n_trees=0) khc_adult.fit(X_adult_train, y_adult_train) .. raw:: html
KhiopsClassifier(n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Accessing the Classifier’ Basic Train Evaluation Metrics -------------------------------------------------------- Khiops calculates evaluation metrics for the training dataset. We access them via the model’s attribute ``model_report`` which is an instance of the ``AnalysisResults`` class. Let’s check this out: .. code:: ipython3 iris_results = khc_iris.model_report_ print(type(iris_results)) .. parsed-literal:: The model evaluation report is stored in the ``train_evaluation_report`` attribute of ``iris_results``. .. code:: ipython3 iris_train_eval = iris_results.train_evaluation_report print(type(iris_train_eval)) .. parsed-literal:: We access the default predictor’s metrics with the ``get_snb_performance`` method of ``iris_train_eval``: .. code:: ipython3 iris_train_performance = iris_train_eval.get_snb_performance() print(type(iris_train_performance)) .. parsed-literal:: This object ``iris_train_performance`` is of class ``PredictorPerformance`` and has ``accuracy`` and ``auc`` attributes: .. code:: ipython3 print(f"Iris train accuracy: {iris_train_performance.accuracy}") print(f"Iris train AUC : {iris_train_performance.auc}") .. parsed-literal:: Iris train accuracy: 0.96 Iris train AUC : 0.9914 The ``PredictorPerformance`` object has also a confusion matrix attribute: .. code:: ipython3 iris_classes = iris_train_performance.confusion_matrix.values iris_confusion_matrix = pd.DataFrame( iris_train_performance.confusion_matrix.matrix, columns=iris_classes, index=iris_classes, ) print("Iris train confusion matrix:") iris_confusion_matrix .. parsed-literal:: Iris train confusion matrix: .. parsed-literal:: Iris-setosa Iris-versicolor Iris-virginica Iris-setosa 50 0 0 Iris-versicolor 0 49 5 Iris-virginica 0 1 45 To further explore the results we can see the report with the Khiops Visualization app: .. code:: ipython3 # To visualize uncomment the lines below # khc_iris.export_report_file("./iris_report.khj") # kh.visualize_report("./iris_report.khj") Exercise ~~~~~~~~ Access the adult modeling report and print its type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_results = khc_adult.model_report_ type(adult_results) .. parsed-literal:: khiops.core.analysis_results.AnalysisResults Save the evaluation report of the ``Adult`` classification into the variable ``adult_train_eval`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_eval = adult_results.train_evaluation_report Show the model’s train accuracy, auc and confusion matrix ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_performance = adult_train_eval.get_snb_performance() print(f"Adult train accuracy: {adult_train_performance.accuracy}") print(f"Adult train AUC : {adult_train_performance.auc}") adult_classes = adult_train_performance.confusion_matrix.values adult_confusion_matrix = pd.DataFrame( adult_train_performance.confusion_matrix.matrix, columns=adult_classes, index=adult_classes, ) print("Adult train confusion matrix:") adult_confusion_matrix .. parsed-literal:: Adult train accuracy: 0.869334 Adult train AUC : 0.925553 Adult train confusion matrix: .. parsed-literal:: less more less 35197 4424 more 1958 7263 Deploying a Classifier ---------------------- We are now going to deploy the ``Iris`` classifier ``khc_iris``, that we have just trained, on the same dataset (normally we do this on new data). The learned classifier can be deployed in two different ways: - to predict a class that can be obtained using the ``predict`` method of the model. - to predict class probabilities that can be obtained using the ``predict_proba`` method of the model. Let’s first predict the ``Iris`` labels: .. code:: ipython3 iris_predictions = khc_iris.predict(X_iris_train) print("Iris model predictions (first 10 values):") iris_predictions[:10] .. parsed-literal:: Iris model predictions (first 10 values): .. parsed-literal:: array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa'], dtype=object) Let’s now predict the probabilities for each ``Iris`` type. Note that the column order of this matrix is given by the estimator attribute ``khc.classes_``: .. code:: ipython3 iris_probas = khc_iris.predict_proba(X_iris_train) print(f"Iris classes {khc_iris.classes_}") print("Iris model probabilities for each class (first 10 rows):") iris_probas[:10] .. parsed-literal:: Iris classes ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] Iris model probabilities for each class (first 10 rows): .. parsed-literal:: array([[0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729], [0.99730542, 0.00134729, 0.00134729]]) Exercise ~~~~~~~~ Use the ``predict`` and ``predict_proba`` methods to deploy the ``Adult`` model ``khc_adult`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Which columns are deployed in each case? .. code:: ipython3 adult_predictions = khc_adult.predict(X_adult_train) print("Adult model predictions (first 10 values):") display(adult_predictions[:10]) adult_probas = khc_adult.predict_proba(X_adult_train) print(f"Adult classes {khc_adult.classes_}") print("Adult model predictions for each class (first 10 rows):") display(adult_probas[:10]) .. parsed-literal:: Adult model predictions (first 10 values): .. parsed-literal:: array(['less', 'more', 'less', 'less', 'less', 'more', 'less', 'more', 'more', 'more'], dtype=object) .. parsed-literal:: Adult classes ['less' 'more'] Adult model predictions for each class (first 10 rows): .. parsed-literal:: array([[9.99994845e-01, 5.15479465e-06], [4.05868754e-01, 5.94131246e-01], [9.61770510e-01, 3.82294902e-02], [9.12629478e-01, 8.73705223e-02], [5.62226618e-01, 4.37773382e-01], [2.32734078e-01, 7.67265922e-01], [9.93356522e-01, 6.64347792e-03], [4.24222870e-01, 5.75777130e-01], [1.79285954e-03, 9.98207141e-01], [5.17299187e-06, 9.99994827e-01]]) Open the training report with the Khiops Visualization app ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 # To visualize uncomment the lines below # khc_adult.export_report_file("./adult_report.khj") # kh.visualize_report("./adult_report.khj")